EggNog-Mapper initial test analysis(2025-01-10 to 2025-01-14)

Introduction

Aaron emailed me back with advice on a way to make eggNog-mapper/2.1.12 on hawk work. As i am revising for an exam i cant spend too long on this. p.s. i think i did pretty good on the exam

Methods

From the 10th to the 12th i tried running a [[test file]] i made earlier in december 2024, i may not have written about it as i didnt get anywhere at that time. This process was also hindered by hawk being very encumbered in the new year meaning it takes a good half day for any results to appear. On the 11th i managed to get a successful result with the .xlsx file i need for the heatmaps for accession 3Dt1c. Today(the 12th) i set off 3 jobs each containing 3 accessions to hopefully get the results of the remaining 9. If there are no complications i will then be in a place where i can obtain the .fastas for the online comparison accessions and then run them through eggNog-mapper. It should be noted that i used the same list of parameters i found on the web version of hawk for this, [[[[[[[[[[screenshot attached]]]]]]]]]]]]. I had some marginal success, the second set of three finished in over 3 hours, the other 2 sets are still going strong after 8, so ill let them time out over night and see what i have, maybe they will complete, i did give them 12 hours.I then discovered the command dbmem which could help to speed up the process. I tested with [[this file]] set to run over night from roughly 9 30 pm on the 12th to 3 30 am on the 13th, totalling 6 hours for 5 files, not great. I then experimented with taking out some of the arguments --evalue 0.001 --score 60 --pident 40 --query_cover 20 --subject_cover 20, that produced [[this script]]. Tomorrow(14th) i will have a look at recreating run 1 with dbmem switched on.

Results

Run one had mixed results, 3 sets of 3 ran in parallel, set 1 completed in 3 and a half hours, good. set 2 took 8 and a half hours, bad, set 3 timed out, very bad. so dont have the outputs needed for 1Dt100h or 1Dt1h. Run 2 was more successful, with 5 solid looking outputs in 6 hours. Run 3 did not improve on that time despite the extra parameters being cut.

Conclusion

It could have been overcrowding on hawk, however it appears that running multiple sets in parallel adversely affects the result, however, i have not tested this with dbmem on. The large problem is the volume of samples required, the desired output is 3 heatmaps comparing KO pathways: - Comparing genera inside sphingomonadaceae - Comparing genera inside Microbacteriaceae - Comparing the genera containing just our samples

Additional pre-eggnog analysis + API (2025-01-17 to 2025-01-__)

Introduction

Hawk is still running slow, so while those jobs from the previous section are running, i figured i would like to do something extra to fill time. I wanted to see just how many samples we are going to need to pull down and process for this and get a rough time estimate for that, as well as the API stuff to pull the files down as that is by far the easy bit

Methods

I started this by making tables based along the sets of samples i am going to need off of the ncbi website. I identified 3 groups of samples: 1. all the genera in the family sphingomonadaceae / 2. all the genera in the family Microbacteriaceae / 3. all the genera containing our flye_asm samples This fits with the specification of work i was given. Being 2 analyses, 1 for comparing just our genera and another comparing genera in families we have multiple samples in. As of now i am yet to do the API call. (still the 17th) I decided to run some more tests to try cut the time down by adding some more commands to my [[[[anothergo.sh]]]] script, [[[[[[specifically____]]]]]] I was having trouble with creating the slurm scripts on hawk, it was a lot of typing complex strings, so i made another script maker to automate that, which will be much more convenient when working at scale, [[[[[here]]]]]

Results

All tables

Family Sphingomonadaceae

Table 10 - count of genera in the family Sphingomonadaceae with 30+ accessions
accessions as found in the .tree file outputted by gtdbtk analysis done on 2024-12-24
Family and Genus Number of Accessions
f__Sphingomonadaceae
g__Erythrobacter 66
g__Novosphingobium 115
g__Sphingobium 77
g__Sphingomicrobium 38
g__Sphingomonas 205
g__Sphingopyxis 62
Total 563

Family Microbacteriaceae

Table 11 - count of genera in the family Microbacteriaceae
accessions as found in the .tree file outputted by gtdbtk analysis done on 2024-12-24
Family and Genus Number of Accessions
f__Microbacteriaceae
g__73-13 2
g__Agreia 6
g__Agrococcus 16
g__Agromyces 41
g__Agromyces_B 1
g__Alpinimonas 1
g__Amnibacterium 2
g__Aquiluna 15
g__Aurantimicrobium 3
g__CAIOLM01 1
g__Canibacter 4
g__Chryseoglobus 8
g__Clavibacter 17
g__Cnuibacter 1
g__Compostimonas 1
g__Conyzicola 3
g__Cryobacterium 43
g__Cryobacterium_C 1
g__Curtobacterium 51
g__Cx-87 1
g__Diaminobutyricibacter 1
g__Diaminobutyricimonas 2
g__Frigoribacterium 15
g__Frondihabitans 5
g__Galbitalea 2
g__Glaciibacter 1
g__Glaciihabitans 2
g__Gryllotalpicola 3
g__Gulosibacter 9
g__Herbiconiux 7
g__Homoserinimonas 4
g__Humibacter 4
g__JAAFHU01 1
g__JAFIQW01 1
g__Klugiella 1
g__Labedella 4
g__Lacisediminihabitans 5
g__Leifsonia 19
g__Leifsonia_A 4
g__Leifsonia_B 1
g__Leucobacter 43
g__Lumbricidophila 1
g__Lysinibacter 1
g__MWH-TA3 7
g__Marinisubtilis 3
g__Marisediminicola 4
g__Microbacterium 254
g__Microbacterium_A 4
g__Microcella 3
g__Microterricola 7
g__Mycetocola 3
g__Mycetocola_A 5
g__Mycetocola_B 1
g__NC76-1 1
g__Naasia 4
g__OACT-916 1
g__Okibacterium 2
g__Planctomonas 2
g__Plantibacter 6
g__Pontimonas 10
g__Protaetiibacter 9
g__Pseudoclavibacter 9
g__Pseudoclavibacter_A 3
g__Pseudolysinimonas 5
g__RFQD01 2
g__Rathayibacter 22
g__Rhodoglobus 15
g__Rhodoluna 35
g__Root112D2 1
g__SCRE01 1
g__Schumannella 4
g__Subtercola 9
g__Terrimesophilobacter 3
g__Tropheryma 1
g__UBA3913 2
g__UBA963 5
g__WSTA01 2
g__Yonghaparkia 5
g__ZJ450 2
Total 806

Our Genera

Table 12 - count of genera from accessions produced at Bangor
accessions as found in the .tree file outputted by gtdbtk analysis done on 2024-12-24
Genus Number of Accessions
g__ 1
g__Brachybacterium 32
g__Brevibacterium 43
g__Microbacterium 254
g__Pantoea 52
g__Sphingomonas 205
Total 587

There are 887 accession in sphingomonadaceae, 806 in microbacteriaceae and 587 in just our genera, however, there is overlap inside the genera Sphingomonas and Microbacterium, assuming those values are held inside the respective numbers before, that comes out to 128 “unique” samples from that group. This however contains the sample that could not even be given a family, 1Dt100h, so discounting this odd sample i cannot even logically analyse 127 accessions. This brings me to the total of 1,820 accessions that need to be processed. The fastest i have been able to process samples on eggnog is roughly 1 hour and 10 minutes per sample. Multiplying 1,820 by 1.1667 = 2,123.4 hours. This means that if i could do them constantly, it would take over 88 days, so 2 months to just run the samples through eggnog. Through testing i have already made a brief start, with maybe 20 done.

Right, as of 11:12 today (17th) i have managed to do a successful run using [[[[[this file]]]]] that gave me the output in a nice 25 minutes. I added some extra fields that i saw both on the online eggnogmapper and from a stack-overflow from someone doing a similar thing and something worked. using this new number, 1,820 x 0.4167 = 758.4 hours = 31.6 days so even now i’m down to just over a month, good stuff. This one used 10 cpus, the online one used 20, but i didnt want to overdo it with hawk, now im thinking can i get that number to half again by doubling the cpus?

Conclusion

This number of hours is not one i can logistically work with, maybe Aaron would be fine with me taking that time, but i feel i can do it faster. I am still working on improving the script to cut down on time taken, but hawk is being a pain in that regard, i have also starting thinking of alternatives or ways to cut down the list. My front-running idea is to take samples from only genera with more than 10, or 30 or some other arbitrary high number of samples. This would cut out all groups with only 1-4 accessions in them which i dont think are useful to this analysis anyway as an average cannot be calculated, it would also make the heatmaps much more legible, so i might run the numbers on what cutting down the sample size would look like. It may however, be better to group all those small genera into an “other” column to make the dataset as big as possible, but that doesnt solve my predicament.

📌 ?: TODO: [eggnog-mapper on hawk is slow, possibly too slow to scale to where we need it to be. alternatives: > what scale are we looking at - 1693 samples for the family comparison > could limit to genera with more than 30 samples > screenscrape-method > enlist more manpower to do online(if we have to do on web)] > look at de_novo to see what species each sample was closest to

##full numbers Assuming I want to limit the family analysis to larger groups (more than 30), the total samples are:

Table 13 - count of samples in all genera of interest, by family
accessions as found in the .tree file outputted by gtdbtk analysis done on 2024-12-24
Family and Genus Number of Accessions
Grand Total 1157
f__Brevibacteriaceae
g__Brevibacterium 43
Total 43
f__Dermabacteraceae
g__Brachybacterium 32
Total 32
f__Enterobacteriaceae
g__Pantoea 52
Total 52
f__Microbacteriaceae
g__Agromyces 41
g__Cryobacterium 43
g__Curtobacterium 51
g__Leucobacter 43
g__Microbacterium 254
g__Rhodoluna 35
Total 467
f__Sphingomonadaceae
g__Erythrobacter 66
g__Novosphingobium 115
g__Sphingobium 77
g__Sphingomicrobium 38
g__Sphingomonas 205
g__Sphingopyxis 62
Total 563